Abstract
Background: Myeloid neoplasms (MN) often originate in the bone marrow as premalignant clonal hematopoiesis (CH), the age-related expansion of a clonal population of hematopoietic cells possessing mutations in MN driver genes. Risk stratification of CH has facilitated clinical trials of therapeutic intervention in high-risk CH (HR-CH) with hopes of mitigating MN-associated mortality. However, as only ~1% of adult CH cases detected by common clinical next generation sequencing (NGS) assays are HR-CH, the potential benefits of population-level screening for CH are currently outweighed by the risks of overdiagnosis of low-risk CH. Strategies to identify populations enriched for HR-CH prior to NGS testing are necessary to develop successful programs for early detection of CH and prevention of MNs. CBC parameters are consistently cited among features prognostic of risk of CH transformation to MN. We surmised that the molecular features of HR-CH would associate with CBC parameters enabling derivation and validation of a machine learning model that could classify adults as likely or unlikely to have HR-CH, prior to knowledge of CH genotypes.
Methods: We used a cohort of 461,620 adults aged 40-70 years without prior hematologic malignancy from the UK Biobank (UKB). CH was detected in 28,817 adults defined by the presence of at least one somatic mutation in genes associated with CH or MN at a variant allele fraction (VAF) ≥ 2%. Individuals were assigned a molecular risk score ranging from 0 to 7.5 based on previously established prognostic markers: number of mutations; VAF ≥ 20%; high risk mutations (SF3B1, SRSF2, ZRSR2, RUNX1, IDH1/2, TP53, JAK2); or single DNMT3A mutation. We used molecular scores to assign 3 multiclass labels representative of distinct CH and MN outcomes: score < 4.5 as class 0 (negative), score ≥ 4.5 with incident MN as class 2 (positive), and score ≥ 4.5 without MN as class 1 (indeterminate). Then, using CBC parameters as model inputs, we trained a balanced random forest model to make class assignments for 80% of the cohort (n= 369,296) and applied 5-fold stratified cross-validation for internal validation and tuning. The remaining 20% of the UKB (n = 92,324) and an external clinical cohort (n = 2,466) were used to validate model performance. Additional validation using data from the Mass General Brigham Biobank is underway.
Results: Key model features included: age, red cell distribution width (RDW), mean corpuscular volume (MCV), plateletcrit, giant platelets, hemoglobin concentration, and absolute blood counts. RDW, platelet morphology, and age had the greatest influence on model prediction. In the UKB validation cohort, we achieved an AUC of 0.9 for classifying positive cases (class 2) and AUC of 0.68 for negative cases (class 0).
We then compared our classifier's pre-NGS class predictions to post-NGS clonal hematopoiesis risk score (CHRS)-defined risk categories. The positive class captured 100% CHRS-defined HR-CH and 54.7% individuals with intermediate risk CH cases as positive (class 2) and 53.8% of all CHRS low-risk or CH negative cases as negative (class 0). False negatives (missed MNs) were rare, with only 7 (6.3%) of 111 individuals with CH and incident MN misclassified as negative. Notably, all false negatives had normal blood counts and were CHRS low-risk. In the external clinical cohort, positive class sensitivity was 91% (AUC 0.75) and only 1 (2.2%) of 46 individuals with CH and incident MN was misclassified as negative.Conclusions: We developed a random forest model that employs CBC parameters to identify populations enriched for HR-CH and validated model performance using population and clinical datasets. Overall model accuracy was moderate, reflecting a risk-averse approach that prioritizes sensitivity to minimize false negatives. This strategy is appropriate in clinical pre-screening contexts to facilitate capture of all potentially at-risk individuals in the positive class while safely excluding a significant fraction of those with exceedingly low risk of incident MN via negative classification. Though false positives exist, these individuals may be filtered through follow-up NGS testing. False negative classifications were minimal and occurred primarily in individuals with normal blood cell parameters, underscoring the need for future studies to identify novel latent morphological and clinical biomarkers of MN risk.